fix(instrumentation-runtime-node): use absolute results in eventLoopUtilization computation #3118

mveroone · 2025-09-27T14:25:18Z

Which problem is this PR solving?

The current implementation of the Event Loop Utilization passes a delta value to the call instead of an absolute one.
The NodeJS perf_hooks documentation is a little ambiguous but it does say that if calling eventLoopUtilization() with 1 argument, it should be the result of a call to that same function without argument :

utilization1 <Object> The result of a previous call to eventLoopUtilization().

(Emphasis is mine)

The result of this bug is that the value tends to stabilize over time because we pass a diff of a diff of a diff and we tend to just return the value since the start of the process instead of a delta since last execution

Short description of the changes

Replaced the setting of the lastValue internal variable with a call to the argument-less perf_hook.
Given that this only queries internal counters, I believe it's light enough that we can afford to call it twice per tick. The alternative would be to bypass the auto-calculation of the delteas provided as a helper and perform calculation of the ratio ourselves with a couple arithmetic operations.

Note

As a reference, Datadog library fixed the same bug last month, but they chose to disregard nodejs autocalculation of the utilization ratio and just do it themselves
DataDog/dd-trace-js#6344
(line 259 in the new version of the file, search for "elu" if needed)

linux-foundation-easycla · 2025-09-27T14:25:25Z

The committers listed above are authorized under a signed CLA.

✅ login: david-luna / name: David Luna (0260d5d)
✅ login: mveroone / name: Maxime Véroone (a5fe644)

mveroone · 2025-10-02T15:21:16Z

@d4nyll CCLA Signed. Sorry for the delay.

Note : I'm available for discussing it upon need, ideally during Europe business hours, but can arrange otherwise.

d4nyll · 2025-10-13T22:54:24Z

Hey @mveroone, thank you for raising the issue and apoloigies for taking my time on it. It is indeed a big bug.

In your fix, there are areas of code for which ELU metrics being reported won't take into account:

const elu = eventLoopUtilizationCollector(this._lastValue);
// From here
observableResult.observe(elu.utilization);
this._lastValue = elu;
// To here
this._lastValue = eventLoopUtilizationCollector();

Whilst it's not such a big deal (it's only the timespan of running those two lines), it would be preferable to leave no time gaps with what's being captured.

Calling eventLoopUtilizationCollector() twice this way also calls process.hrtime() twice under the hood.

What do you think about this implementation instead?

const currentELU = eventLoopUtilizationCollector();
const deltaELU = eventLoopUtilizationCollector(currentELU, this._lastValue);
this._lastValue = currentELU;
observableResult.observe(deltaELU.utilization);

It will:

ensure there are no time gaps in the ELU measurements
only call process.hrtime() once, as the second call (i.e. eventLoopUtilizationCollector(currentELU, this._lastValue)) will only perform a subtraction.

mveroone · 2025-10-17T07:54:48Z

Hey @d4nyll ,
Thanks for taking the time to review this. That's a great catch, I had completely missed this. (being really not accustomed to developing in general and TS/JS in particular).

Your solution also has the advantage of being way more self-explaining and should likely confuse future readers less than the previous version.

I took the liberty to commit your suggestion, hope that's fine by you ?

d4nyll · 2025-10-21T22:06:51Z

@mveroone Hey! On my local branch I added the following test to packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts make sure we get this right 100%.

  it('should correctly calculate utilization deltas across multiple measurements', async function () {
    // This test ensures the bug where delta of deltas was observed instead of deltas of absolute values
    // does not regress. See https://github.com/open-telemetry/opentelemetry-js-contrib/pull/3118
    // This bug would surface on the third callback invocation.

    const instrumentation = new RuntimeNodeInstrumentation({});
    instrumentation.setMeterProvider(meterProvider);

    // Helper function to create blocking work that results in high utilization
    const createBlockingWork = (durationMs: number) => {
      const start = Date.now();
      while (Date.now() - start < durationMs) {
        // Busy wait to block the event loop
      }
    };

    // Helper function to collect metrics and extract utilization value
    const collectUtilization = async (): Promise<number> => {
      const { resourceMetrics } = await metricReader.collect();
      const scopeMetrics = resourceMetrics.scopeMetrics;
      const utilizationMetric = scopeMetrics[0].metrics.find(
        x => x.descriptor.name === METRIC_NODEJS_EVENTLOOP_UTILIZATION
      );

      assert.notEqual(utilizationMetric, undefined, 'metric not found');
      assert.strictEqual(utilizationMetric!.dataPoints.length, 1, 'expected one data point');

      return utilizationMetric!.dataPoints[0].value as number;
    };

    // Wait for some time to establish baseline utilization
    await new Promise(resolve => setTimeout(resolve, 200));

    // First collection
    const firstUtilization = await collectUtilization();
    assert.notStrictEqual(firstUtilization, 1, 'Expected utilization in first measurement to be not 1');

    // Second measurement: Create blocking work and measure
    createBlockingWork(50);
    const secondUtilization = await collectUtilization();
    assert.strictEqual(secondUtilization, 1, 'Expected utilization in second measurement to be 1');

    // Third measurement: Create blocking work again and measure
    // This is where the bug would manifest - if we were observing delta of deltas,
    // this measurement would not be 1
    createBlockingWork(50);
    const thirdUtilization = await collectUtilization();
    assert.strictEqual(thirdUtilization, 1, 'Expected utilization in third measurement to be 1');

    // Fourth measurement (should be the same as the third measurement, just a sanity check)
    createBlockingWork(50);
    const fourthUtilization = await collectUtilization();
    assert.strictEqual(fourthUtilization, 1, 'Expected utilization in fourth measurement to be 1');

    // Fifth measurement: Do some NON-blocking work (sanity check, should be low)
    await new Promise(resolve => setTimeout(resolve, 50));
    const fifthUtilization = await collectUtilization();
    assert.ok(fifthUtilization < 0.1, 'Expected utilization in fifth measurement to be less than 0.1');
  });

On close inspection, for my suggested code / your last commit (96ec7a4) to work on the first scrape, _lastValue can't be undefined, otherwise on the first scrape const deltaELU = eventLoopUtilizationCollector(currentELU, this._lastValue); would effectively be const deltaELU = eventLoopUtilizationCollector(currentELU); which would give a delta between the previous line (i.e. const currentELU = eventLoopUtilizationCollector();) and this line, resulting in a very small deltaELU that looks something like deltaELU { idle: 0, active: 0.6307079792022705, utilization: 1 }. So the first scrape will always give a utilization of 1.

So I think the last thing we need to do here is:

Change the line private _lastValue?: EventLoopUtilization; to private _lastValue: EventLoopUtilization = eventLoopUtilizationCollector();. This should give these values in the test I wrote:
```
firstUtilization 0.003609673196669925
secondUtilization 1
thirdUtilization 1
fourthUtilization 1
fifthUtilization 0.029487388913258965
```
(Doing this does mean the utilization before EventLoopUtilizationCollector is initialized (~100ms) is lost, but I think that's fine. If someone really cares about the startup utilization, they can run EventLoopUtilization themselves at a specific point in their code where they deem startup is finished, and not rely on the metric being scraped)
Add the test to packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts to prevent this from regressing

I appreciate this PR is turning into a bigger one than you first imagine, but thanks for working with me to get this bug squashed!

mveroone · 2025-10-24T08:04:51Z

Thanks a lot @d4nyll
This is awesome.

I appreciate this PR is turning into a bigger one than you first imagine, but thanks for working with me to get this bug squashed!

No problem, I too prefer we arrive at the right robust and future-proof solution instead of a quick and simple one.

Is that fine if I commit both of your suggestions myself ? I'm not accustomed to this community's traditions, and I definitely wouldn't want to rob your of attribution for your work.
Anyway I'll try to do it on my side if only to learn how testing works here (Being initially a sysadmin by trade, these are things I'm learning late), but will await your response before pushing it to this branch.

EDIT : as per our private conversation, you're welcome to send a PR with the above suggestions against this branch and i'll gladly review it to the best of my ability.

d4nyll

Hey @mveroone I think it looks good. Just need to update the branch with the main branch, fix any conflicts, run the tests again (just to make sure) and we should be good 🙏

packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts

packages/instrumentation-runtime-node/src/metrics/eventLoopUtilizationCollector.ts

@d4nyll

Co-authored with @d4nyll

Co-authored-by: Daniel Li <[email protected]>

d4nyll

LGTM!

raphael-theriault-swi

Thank you for working on this !

d4nyll · 2025-10-29T09:33:40Z

I see the unit test failing for Node.js v18 with AssertionError [ERR_ASSERTION]: Expected utilization in fifth measurement to be less than 0.1. I'll look into it now.

david-luna · 2025-10-29T09:33:50Z

@d4nyll @mveroone

Tests are failing for nodejs v18. Could you have a look?

packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts

…ion.test.ts Co-authored-by: Daniel Li <[email protected]>

mveroone · 2025-10-31T13:07:24Z

I see the unit test failing for Node.js v18 with AssertionError [ERR_ASSERTION]: Expected utilization in fifth measurement to be less than 0.1. I'll look into it now.

I have been trying to reproduce but haven't had any chance to. Either by emulating GHA tests with act or running them locally against 18.0, 19.19 or 18.20.

Could it depend on the parallelization of tests by nx ? Is there a guarantee that the event loop is dedicated to one test at a time while running ? Otherwise it might get flacky depending on test runtime environment.

d4nyll · 2025-10-31T13:29:35Z

Otherwise it might get flacky depending on test runtime environment.

With a5fe644 it would be very unlikely to be flakey as the event loop would need to be completely busy for the entire 50ms we are waiting for it.

@mveroone can you update the branch with the base branch and @david-luna would you be able to run the tests again afterwards?

mveroone requested a review from a team as a code owner September 27, 2025 14:25

github-actions bot added the pkg:instrumentation-runtime-node label Sep 27, 2025

github-actions bot assigned d4nyll Sep 27, 2025

github-actions bot requested a review from d4nyll September 27, 2025 14:25

mveroone force-pushed the fix/runtime_metrics/elu branch 2 times, most recently from 3ac9601 to 8d0191d Compare October 2, 2025 15:20

d4nyll suggested changes Oct 27, 2025

View reviewed changes

packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts Outdated Show resolved Hide resolved

packages/instrumentation-runtime-node/src/metrics/eventLoopUtilizationCollector.ts Outdated Show resolved Hide resolved

mveroone and others added 4 commits October 27, 2025 20:30

fix: use absolute results in eventLoopUtilization computation

0c9ba08

Improve solution from review

a747c1c

Improve solution for first iteration and add tests

c4763d9

Co-authored with @d4nyll

Comment change from review

8398e8f

Co-authored-by: Daniel Li <[email protected]>

mveroone force-pushed the fix/runtime_metrics/elu branch from 95cff5d to 8398e8f Compare October 27, 2025 19:30

Update tests after rebase

46677cc

d4nyll approved these changes Oct 27, 2025

View reviewed changes

d4nyll added the has:owner-approval Approved by Component Owner label Oct 27, 2025

raphael-theriault-swi approved these changes Oct 28, 2025

View reviewed changes

david-luna approved these changes Oct 29, 2025

View reviewed changes

david-luna changed the title ~~fix: use absolute results in eventLoopUtilization computation~~ fix(instrumentation-runtime-node) : use absolute results in eventLoopUtilization computation Oct 29, 2025

david-luna changed the title ~~fix(instrumentation-runtime-node) : use absolute results in eventLoopUtilization computation~~ fix(instrumentation-runtime-node): use absolute results in eventLoopUtilization computation Oct 29, 2025

Merge branch 'main' into fix/runtime_metrics/elu

0260d5d

d4nyll reviewed Oct 29, 2025

View reviewed changes

packages/instrumentation-runtime-node/test/event_loop_utilization.test.ts Show resolved Hide resolved

Update packages/instrumentation-runtime-node/test/event_loop_utilizat…

a5fe644

…ion.test.ts Co-authored-by: Daniel Li <[email protected]>

fix(instrumentation-runtime-node): use absolute results in eventLoopUtilization computation #3118

Are you sure you want to change the base?

fix(instrumentation-runtime-node): use absolute results in eventLoopUtilization computation #3118

Conversation

mveroone commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which problem is this PR solving?

Short description of the changes

Uh oh!

linux-foundation-easycla bot commented Sep 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mveroone commented Oct 2, 2025

Uh oh!

d4nyll commented Oct 13, 2025

Uh oh!

mveroone commented Oct 17, 2025

Uh oh!

d4nyll commented Oct 21, 2025

Uh oh!

mveroone commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

d4nyll left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

d4nyll left a comment

Choose a reason for hiding this comment

Uh oh!

raphael-theriault-swi left a comment

Choose a reason for hiding this comment

Uh oh!

d4nyll commented Oct 29, 2025

Uh oh!

david-luna commented Oct 29, 2025

Uh oh!

Uh oh!

mveroone commented Oct 31, 2025

Uh oh!

d4nyll commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mveroone commented Sep 27, 2025 •

edited

Loading

linux-foundation-easycla bot commented Sep 27, 2025 •

edited

Loading

mveroone commented Oct 24, 2025 •

edited

Loading

d4nyll commented Oct 31, 2025 •

edited

Loading